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ABSTRACT 

Sequences of queries to a database system can have 
structure. Recognizing this structure is a kind of "parsing", 
analogous to the parsing of sentences. We present two rather 
different approaches to recognition for exploitation. The first is a 
rule-based system that examines superficial aspects of a query 
sequence to postulate preferences between sets mentioned in the 
queries. The second is a deeper, but more limited model based on 
decision theory, which assigns utilities and suitability probabilities 
to individual set items, and attempts to explain set preferences on 
that basis. Both these methods have disadvantages, and their 
performance is difficult to analyze because of the fuzzy nature of 
the application, but it is hoped they can form the basis for more 
comprehensive man-machine interfaces. 



1. Introduction 

Queries to a database query system usually occur in clusters. Though 
discourse-understanding issues such as amaphora and ellipsis are well-studied, 
the more general problem of inference of the user plein behind a connected 
sequence of queries remains elusive. This seems due to the difficulty of 
categorizing amd recognizing the many different needs users have. After all, 
well-defined tasks tend to be better handled by batch processing; tasks with 
vague specifications euid goals tend to work better with interactive query 
systems. 

But we believe some progress can be made towards to goal of figurinp ont wbat 
users aue up to, amd adjusting system behavior accordingly. IJome promising 
nonquantitative research has been done (Hobbs. 1978; Cohen, Perrault, and 
Allen, 1981; Reichman-Adixr. 1984). Even small successes in understainding cam 
have immediate payoffs in better management of previous query results, a 
crucial aspect of database operations. The usual method of throwing out the 
least-recently-used query results ignores a good deal of avciilablc information 
that often suggests better things to throw out. In addition, imderstamding 
query sequences pays off in more cooperative responses to queries, 
identification of previously unrecognizable user semantic errors, and support 
for new kinds of querying. 

This paper synthesizes work summarized in (Rowe, 1984) with some new ideas. 
Memy of the details (especially mathematical) omitted here may be found in 

This paper is to appear in the Proceedings of the Workshop on Expert 
Database Systems, Kiawah Island, South Carolina, October 1984. 



tliat paper. 



2. Some definitions 



Analysis of a long query sequence os a unit is too hard. Our idea is thus to 
simplify the problem to the study of pairs of user queries, to see if we can 
recognize preference phenomena between the results of each query, including 
both difierences in the resulting set compositions and attributes. We will 
identify "preferences" between such pairs (Luce, 1959), yes/no phenomena 
that can, however, be quantified by a certainty factor in the manner of rule- 
based systems such as MYCIN (Buchanan and Shortliffc, 1903). Tie will 
IK>stphone details of the certainty factors to section 4. 



By "query set" we will mean a set cf some items represented in a database, 
items with the same data-type, and with some associated attributes of those 
items. T/e use "predictive power" to measure preference phenomena. That is, 
a user "prefers" one item in a database to another item if he is more likely to 
include that item (directly or indirectly) in future query sets. Preference 
between query sets, however, is more complicated and more "psychological" 
because it must take into accoimt the sizes cf the query sets as well as the 
total likelihood that some member of the set will be included in a future query 
set, for otherwise a set would always be preferred to its subset. In other words, 
we prefer query set i to query set j if 







where is the probability that some member of set i will be included in a 
future query set, Nif. the size of set k in items, and a a constant. (See section 
5.3 for more details.) 



3. Heuristics for query-output preference 

We now suggest some heuristics that may be used to determine the 
preferences and their certainty factors, approximately in order of decreasing 
strength. We assume a relational database. V/c identify three basic kinds of 
tasks a user might be pursuing in using the database quefy system (see (Miller, 
1S69) for more discussion): (l) choosing something from among options for 
real-world action; (2) generating a report for management on some important 
part of the datable; and (3) preparing a statistical analysis of significant 
phene me nil in llic data.- ¥/c write i °ner queries in etimdrird. fent, system 
responses in itaUcs. 



3.1. Real-world connection 

Which suppliers of wadgets are located in California? 

There are 33 names. 

Send to the printer the address, location, and price of those 
with the ten cheapest prices. 



If a user employs some query set in the real world in preference to another. 
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that usually means he prefers the ferrusr to the latter. The main difficulty is 
figuriD/T out what happened in the real world. This depends on ho;y policies 
decided with the help cf the database are exhibited later in the data. If the 
database is the actual tool used to accomplish things — as when an order con be 
made merely by adding on order record to on order relation — one can merely 
search that relation to find out what happened. If the user’s goal is only to 
prcpiiro a report, then sending records or statistics on some query set of! to a 
printer or special graphics device suggests that query set is preferred to 
Einother set not so treated. 



3.2. Implcmcntational "handles" 

How many suppliers of widgets are located in California? 

There are 33. 

List the names of the ten nearest. 

<listing> 

If a specific real-world action is implemented from database information (os 
opposed to statistical analysis), "handles" on the data are often necessary. 
That means something that links the database records with real-world entities, 
like a n 2 unc, identification number, , or even some unique description. A query 
set with this information is preferable to one without. 



3.3. Implementation preconditions 



List the names of wdget suppliers located in California. 
<33 name s> 

Give their addresses and phone numbers. 

<listing> 



A related but distinct issue to the last heuristic is the preconditions that are 
often necessauy for real-world implementation. Often just unique 
identification is enough, but certain unique identifiers make action much 
easier than others. You can usually phone a person knowing only their name, 
but knowing the phone number speeds things considerably. 



3,4. Ihsliixgmshabilily 

List the widget-like products supplied by California widget suppliers. 
<65 listings, but "widget" the only product listed> 

List the sizes of widgets supplied by California widget suppliers. 
<listing with seven different size values> 



If the user’s goal is to choose something from the database, there has to be a 
basis for a choice. That is, there must te different values displayed for some 
attribute of some of the items in a queiy set. A set without such 
"distinguishability" (as the first set above) is less desirable than one that has it 
(as the second set above, which gives different sizes for different values). 



3.5. Expletives 



List the California widget suppliers and their addresses. 

<listing of 33 names and addresses> 

Great. What are their prices on crates of widgets? 

<listing of 33 prices> 

Ugh. What about prices on suppliers in other Western states? 

A natural language query environment has some important advantages over a 
formal query language environment (e.g., SQL or QUEL): non-goal, thematic 
information can be exploited. Human information-seeking conversation 
contains many kinds of evaluative cues, and users could be encouraged to 
volunteer them to a database query system tco. Many of these cues are one- 
werd expletive tags on the beginning of sentences, and are cany to parse. Often 
their meaning is quite clear. "Good”, ”ok”, "fine”, "great”, "swell", and 
"amazing" denote positive preferences to query results not so tagged. 
Similarly, "ugh", "argh", "oh no", "bad", "slop", and assorted profanity denote 
negative preferences. Some expletives are unclear, as "hmm", "wait", and 
"funny"'. Note that these cues occur after the query result, so they are found 
in a ddlerent place than evidence for other heuristics. 



3.6. Repeated mention 

What suppliers of vddgets are in California? 

<33 names> 

What are their addresses and prices per crate? 
<33 addresses andprices> 



Repeated references to the same set suggest that set is preferred to others. 
Each repetition increases the likelihood. 



3.7. lateness of use 

What suppliers of widgets are in California? 
<33 name s> 

OK, bye. 



People usually stop searching when they find what they are looking for. If the 
goad of a query session is to find something (as opposed to explore or browse), 
then the last query usually gives the information the user most prefers. Other 
query sets late in a session may be similar (the queries follomng them may 
just be "doublcchccking"), so being late in a session should have weight too. 
'fhis heuristic has similarities to the least-recent-used priority method, but it 
need not be so linear — it usually doesn’t matter much whether something was 
used 10 queries ago or 15, since neither result has been used in a long while 
and is not likely to be relevant. 



3.B. SubsettiDg 



WhaL suppliers of rndgi-’ts are in California? 

<33 names> 

Which have widgal.s under $20 a crate? 

<16 names> 

W'hich have widgets $20 to $30 a crate? 

<18names> 

When subsets of a query sot subsequently taken, it suggests the set is more 
important than these not so treatoiL Each addltioned subset increases the 
preference of the set. Note the subset must occur alter the set for this to bo 
meaningful. The phenomenon is simileir to that with repealed references to 
the same set, but not as strong in establishing preference. 



3.9. Attribute exhaustion 

Tell me everything you know about widgets from California suppliers. 



If the database has suK>licd everything it has about a query set, a user is less 
Ukely to cisk future questions about that set as opposed to some other. But the 
user might repeatedly ask for the same data when trying to prepare a table or 
a report for presentation. 



3.10. Statisticed interest 



W'hat are the widget suppliers in California? 

<33 nam.es> 

What fraction of the widget suppliers are current manufacturing frobs? 
<all but one> 



If there is a statistical-analysis aspect to the user’s task, any unexpected 
results make a set more interesting than those without such results. One can 
define "unexpected" as the degree to which counts and sums ceinnot be 
predicted from independence-assumption models and other linear models, cr 
one can e 2 q>loit detailed causal-relationship models (Blum, 1SD2). 



3.11. Small query sets 

lYhat are the \vidget suppliers in California? 
<33 names> 

Which Eire within 10 miles? 

<Snames> 



If users arc trying to choose a single object, they tend to neurow possibilities 
progressively. If so, the small (but nonempty) sols towards the end of the 
query session are preferable, when the most factors have been taken into 



account. 



4. Putting the heuristics together 



4.1. Certainty factors 

We now suggest some reasonable quantitative remkings of the above heuristics, 
for the three types of user tasks mentioned. We assume the user's task is 
known beforehand- For these cert 2 dnty factore, larger numbers mean greater 
certainty, and the numbers Eire sc^ed with 1.0 the largest (to suggest 
probabilities). 



n 


Hounstic name 


Choice task 


Report gen. 


Slat, analysis 


1 


Real-world connection 


1.0 


1.0 


1.0 


2 


Implementation hemdlcs 


.9 


.9 


.9 


3 


Impl. preconditions 


.6 


na 


na 


4 


Dlstinguishability 


.6 


.8 


na 


5 


Expletives 


.6 


.6 


.6 


G 


Repeated mention 


.4 


.5 


.5 


7 


Lateness of use 


.6 


.6 


.5 


B 


Subsetting 


.3 


.3 


.4 


9 


Attribute exhaustion 


.5 


.1 


.1 


10 


Statistical interest 


na 


.2 


.7 


11 


Small query sets 


.4 


.1 


.1 



4.2. Combining certainty factors 

We can evaluate any pair of query results in the output for a user query 
session, and assign a subset of the above certainly factors (chosen from the 
appropriate column) to the pair. To get a cumulative certainty factor, the 
mucliHised independence model of M^CIN (Buchanan and Shortliffe. 1983) 
seems reasonable: 

CFtoUd = 1-nJl-CF,) 



\7e can then build a directed graph with nodes representing the query results, 
£ind with numbers associated with each node representing the certeunty factor 
iii Liie pruieicn.ee oi onu ruiiulL (lq otnoLiiur. Ijj. gunvictl, oiily n iur? oi Lnu 
possible connections may be made in the graph, so usuedly we do not have 
direct evidence of which of two specific sets is preferable. But we can assume 
that preference is transitive: if A is preferred to B, and B to C, A should be 
preferred to C. To compute the certainty factor for such implied preferences, 
we don’t want to use an independence assumption as we did in the formula 
above because clearly the preferences are not independent — they refer to 
common nodes — so instead we suggest the ’’conservative" approach taken by 
fuzzy logic and make the total preference the largest of the preferences in the 
path. 



4.3. An example 



Suppose we have the Tcliowing query session. Again, the system in italics. Wc 
number queries and responses to more easily refer to them. 

1. How many tankers are in the Mediterranean? 

2. 37. 

3. List the American ones. 

4. Tetanic, Bounty, Pequod, Lusitania, Pueblo, Mayaguez. 

5. Give the tonnages for those more than 1000 feet. 

6. None are that long. 

7. Give the tonnages and positions for those over 500 feet. 

B. 



Ship 


Tonnage 


Position 


Bounty 


14000 


40N13E 


Pequod 


8000 


45N5E 


Pueblo 


17000 


43N18E 



9. Good. What are the captain and radio call sign of the Pequod? 

10. Ahab and WHL. 

11. And who owns it? 

12. Peleg Enterprises. 

Assuming that the user is following a choice task: 

Real-world connection (Heuristic 1) does not apply. 

The query sets in queries 3, 5, 7, 9, and 11 are preferred to that of query 1 
by the Implementation Handles Heuristic (Heuristic 2). 

Query set 9 is preferred to 1, 3, 5, and 7 by the Implementational 
Preconditions Heuristic (Heuristic 3). 

Sets 3, 5, 7, 9, and 11 are preferred to 1 by Distinguishabilily (Heuristic 4). 

Set 7 is preferred to all others by the Expletives Heuristic (Heuristic 5). 

Set 11 (= set 9) is preferred to 1, 3. 5, and 7 by Repeated Mention (Heuristic 

6 ). 

Set 11 (= set 9) is also preferred to 1, 3, 5, and 7 by Lateness of Use 
(Heuristic 7). 

Heuristic 8 (Subsetting) applies to every set except 5, causing a preference 
of each set to its successors. 

Heuristic 9 (Attribute Exhaustion) would apply to set 11 (= set 9) only if 
name, tonnage, position, captain, o\mer, and radio call sign were the only 
things known about ships. Let us assume this is not true. 

Heuristic 10 (Statistical Interest) does not apply. 

Set 11 (= set 9) is preferred to 1, 3, 5, and 7 by the Small Query Sets 
Heuristic (Heuristic 11). 




YiC can create a b by b matnx rci renontin^; the prrfcrrnco /-raelc Ihe cnlnar; 
collect the certainty factors (if any) for the preference of that ro^ to thot 
column. 





set 1 


set 3 


set 5 


sot 7 


set 9 (=11 




set 1 


- 


- 


- 


- 


- 




sets 


.9,.B,.3 


- 


- 


- 


- 




set 5 


.9,.B,.3 


.3 


- 


- 


- 




set 7 


.9,. 8,. 6,. 3 


,6,.3 


.6,.3 


- 


.G 




set 9 (=11) 


.9,.8,.8,.4,.6,.3,.4 


B,.4,.6,.3,.4 


.B,.4,.6..3,.4 


.8,.4,.6,.3,.4 


- 





Using the combination formula, the total certainty factors for pairs of query 
sets are as follows: 





set 1 


set 3 


set 5 


set 7 


set 9 (=11) 


set 1 


- 


0 


0 


0 


0 


set 3 


.960 


- 


0 


0 


0 


set 5 


.960 


.300 


- 


0 


0 


set 7 


.994 


.720 


.720 


- 


.6 


set 9 (=11) 


1.000 


.960 


.9B0 


.960 


- 



5. More careful user modelling with decision theory 

These certainty factors are crude, however, and only take into account 
superficial aspects of what a user is trying to do. They are somewhat robust, 
however, and can apply to a broad range of tasks. But if we aire willing to 
narrow our focus somewhat, we can do better. One approach explored in detail 
in (Rowe, 1904) is to model certain choice tasks using detailed decision-theory 
models. 



5.1. Utilities and suitabilities for choice tasks 



When a transportation planner is choosing a vehicle to carry a load somewhere, 
many factors must be taken into account. Some have to do with the costs 
associated with alternatives, others with the availability and reliability of 
options. Following decision theory, we cal! the former utilities, and the latter 
probabilities — though the latter are also a special kind of probability, which for 
lack nf n term we cell ics. As e?i c'~eTTte^ c in rnerr:lir\r'l. 

shipping the utilities eire the financial cost of loading a ship; the fuel, crew 
wages, and miscellaneous transit costs for a voyage; 2 md the time delay in 
getting a cargo to its destination. Suitabilities are the ability of a ship to carry 
a particular kind of cargo; and the ability of a ship (due to its dimensions) to 
be serviced at a particular port. 



fie can sum up subiiUlities to get total utility Uj for an option, eind multiply 
sub-suitabilities (making a reasonable independence assumption) to get a total 
suitability Sj. Then using a simple psychological model of how people make 
choices we have a formula for the probability pi of absolute preference of item 
i to all other items in a set of n items: 



Pi = 



— — n 

l-.5s^ /=i 



1-Sj$ 



^jndif f 



where t jo the inte^^^jral of '.he uni'. Jinnn;d curve about zero. Kelaled 
to this arc discussed iu (Luce, 1959). 

As an example, consider four items with suitability-utility pairs (.6,20), (.5,15), 
(.1,10). (.2,15), and suppose is 5. Then; 

Pi=.29,p2=.52,p3=.87,p4=.43 

which we can normalize to: 

p i=. 14,p2=. 25,p3=.41,p4=.20 

So the third item is the most preferred, with a probability around .41. 



5.2. Slot filling 

To provide some generality in our decision theory model, so we don't have to 
write sepeirate formulae for every different transportation situation, we provide 
"slots" (in the sense of slots in frames) that are instantiated dynaimically from 
the query sequence. These then force a choice among a claiss of similar utility 
and suitabihty assignments. Four common slots for choice tasks are: 

1. Domain of discourse. These are restrictions mentioned repeatedly, that 
essentially limit nonegligible suitabilities to only certain, categories. For 
example, the user asks repeatedly about subsets of American tankers. 

2. Reference standard. These give a po'mt against which objects are 
measured. For example, if the user asks for ships within 100 nautical miles 
of Naples, this suggests that Naples is the reference standard for location, 
and also suggests the user contemplates arriving there or departing from 
there. 

3. Threshold values. These give criterion values suggesting exact quantities 
involved in real-world activities. For example, if the user asks for ships with 
more than 10000 ton capacity, it suggests he wants to transport 10000 tons. 

4. Answer set size. These indicate the usual size of a query answer for a 
pairticular user, and suggest the degree to which the choice problem is 
focussed (the a discussed below). 



5.3. Evaluating sets 

Wo now have a way to evaluate and compare sets: these probabilities for the 
items in the set. Following as before the formula for the probability union 
assuming independence, we can give a cumulative number for the desirability 
of at least one item in a set: 

But following the lead of information retrieved (Lancaster. 1979), this is 
misleading, for it would mean that larger sets would tend to be the most 
desirable. If a set is large, there may be many items with very low 
desirabilities, and this seems unfair. Y/hat we really went for total desirability 
is some weighted sum of the above D and D/n, the density of desirability in the 
set. The exaet weighting a can be user-dependent — see the discussion of user 
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luercircliios below. 



So to compare two sets, we just compute 

E=D+~ 

n 

where D is given by the preceding formula; the set with the larger E is 
preferable. 



6. Coordinating the two approaches to preferences 

Y/e have presented two rather diflercnt approaches to "peu-sing" of query 
sequences: one superficial but broadly applicable, and one deeper but 
complicated and requiring formalization of many subjective judgements. The 
obvious question is whether the two can be reconciled. We think so, on the 
basis of some preliminary experiments. 

We applied the two methods to the same query sequences and compared 
results. Since the heuristic method is simpler, we used it as the standard, and 
adjusted parameters of the decision-theoiy approach until its preference 
results c£une out the same as often as possible. The parameters we adjusted 
were the weights on individual sub-utilities and sub-suitabilities (that is, their 
importance to the total utility or total suitability), though in a more 
comprehensive approach we could include other kinds of parameters like the a 
mentioned above which we assumed was zero. 

Y/e treated preferences as binary for this analysis — that is, if the certainty was 
greater than a criterion we assumed a positive preference, else no preference. 
Y/e then set up inequalities representing matchings of pairs of items, one 
drawn from each set, with the inequality sign in the direction of the 
preference. The paired items had to be distinct — we removed common items 
from the two sets. This gives a large set of linear inequalities, most of which 
are redundant (derivable from others), so we eliminated the redundancies to 
get a small set of irredundant inequalities. We can also get inequalities from 
two other sources: 

1. from past user behavior, suggesting reasonable bounds on utility and 
suitability values 

2. from the query sequence directly, when a query set is a subset of the 
im.m^diately pre\ion9 query set. Si nee ^vib‘^etting p\idenep of 
preference, it suggests that the additional restrictions correspond to 
utilities or suitabilities that are better than those for items in the 
complement of the restriction. For instance, if the user next asks for the 
American ships in a set, it suggests that the nationality suitability for 
American ships is greater than the suitability for any other nationality. 



6.1. "Solvmg" the inequalities 

Our inequalities do not represent absolute criteria, only evidence. Ail we want 
is a representative point in the hyperregion defined by them. Thus we do not 
want to **solve*' them per se, just fina an answer most consistent with them. We 
ceji treat this as an optimization problem with "penalty functions" 
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correspumlin^i to the inequolities (Gill, Murray, and V/^rigM;, 1001} — Ihs foii7e,>^ 
incqualilics violated, the larger the value of the cptiEnizatioii function. Such a 
functiOQ ccin be expressed as follows: 

fi flMJ 

/ = 1 ^indiff 

where N is the number of inequalities, (Ttni.// a measure cf sensitivity to 
inequality violation, and the difference of the right side from the left side of 
the jth inequality, where the inequality sign is turned if necessary to point 
right. And as before, is the integral of the unit normal curve about zero. 

7. Defaults from user hierarchies 

Users will differ a good deal in parameters for the decision theoretic model. 
They will also differ to a lesser extent in the certainty factors they would judge 
appropriate for the heuristic model. So user modelling is important for both 
approaches. But users form groups and subgroups based on people, time, emd 
task. Multiple inheritance of quantitative pareuneters in the manner of (Rowe, 
1SB2) can be used to provide starting defaults for the parameters, which can 
then be adjusted using the methods of the last section among others. 
Adjustments can then bo averaged into the defaults for the groups to which the 
user belongs. To make this work, the user might be asked to critique default 
group assignments at the beginning of each session. 



8. Applications 

We can use preferences among query sets or queiy items for several purposes: 

1. We can develop "knowledge-based temporaries management", intelligent 
deciding of which of the previous query results to save and which to throw 
aw'ay when space is needed. 

2. W'e can prefetch data the user is likely to need, in the perhaps otherv\ise 
wasted time wdiile he is examining one query result and deciding what to 
ask next. 

3. With a decision-theory model of item preference, we can make 
suggestions to the user for what restrictions in a query to relax when a 
query set is empty or too small. 

i. V*e call notice new ciasaes of user errors, ‘utli a decision tlieoiy niodel, 
we can point out incorrect parts of a query expression, or necessary 
missing parts, that would lead to fetching of items with very low' preference. 

5. We can handle queries wdth vague or fuzzy restrictions, since the 
preference mathematics is probabilistic anj^w^ay. 



9. Conclusion 

It is difficult to evaluate man-machine interface Innovations, and this area is 
no exception. We have proposed two approaches to a now and virtually 
uncharted eu-ea. We have tried to justify carefully the steps we have taken, but 



or<ly deLuled expsnnit ntatioQ a!id further study wrjUi these approaches \rd( 
provide the final judgcmcat. 
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